# Here, we are loading the relevant libraries used in this project
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
For my final project, I have decided to analyze a Pokemon data set containing data on Pokemon from the older generations of Pokemon to the most current. Specifically, this data set contains characteristics of each Pokemon in the Pokemon video series games where players collect their own Pokemon in order to battle other Pokemon in the game. Let’s load the dataset and then take a look at the first few elements.
pokemon <- read.csv("Pokemon.csv")
head(pokemon)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## Sp..Def Speed Generation Legendary
## 1 65 45 1 False
## 2 80 60 1 False
## 3 100 80 1 False
## 4 120 80 1 False
## 5 50 65 1 False
## 6 65 80 1 False
Before performing any data analysis, it’s important to explain the meaning of the different columns in the dataset that we are examining. To begin, the first column “ID” is the ID of the Pokemon. Notice how there are repeating Pokemon IDs because the Mega version of a Pokemon is still considered to be the same entity of the originial Pokemon. For example, both Venusaur and VenusaurMega Venusaur have ID 3 because they are both the same Pokemon.
The Type 1 column is the “type” that this Pokemon is, and the Type 2 column is the second “type” that a given Pokemon is. Every Pokemon has to have at least one type, but can also have a second type. It is important to note that Pokemon who do not have a second type have a “” as the value for the Type 2 column. Pokemon Types determine a Pokemon’s strength and weakness to certain attacks that they receieve.
The HP column is how many health points a Pokemon has, the attack column is how many attack points a Pokemon has, the defense column is how many defense points a Pokemon has, the Special Attack column is how many special attack points a Pokemon has, the Special Defense column is how many special defense points a Pokemon has, and the speed column is how many speed points a Pokemon has. The total column is a summation of the number of these various characteristics, and is an indication of how strong or powerful a Pokemon is.
The legendary column indicates whether or not a Pokemon is a legendary Pokemon, and the generation column indicates what generation this Pokemon is from. However, because there are only six generations of Pokemon, it would be a good idea to convert the generation column into a categorical variable, which we’ll do in the code right down below.
To learn more about the general Pokemon series, refer to: https://en.wikipedia.org/wiki/Pok%C3%A9mon_(video_game_series) To learn more about different Pokemon types, refer to: https://bulbapedia.bulbagarden.net/wiki/Type To learn more about the different characteristics of Pokemon, refer to: https://bulbapedia.bulbagarden.net/wiki/Statistic To learn more about what legendary Pokemon are, refer to: https://bulbapedia.bulbagarden.net/wiki/Legendary_Pok%C3%A9mon To learn more about what Pokemon generations are, refer to: http://pokemon.wikia.com/wiki/Generation
# Here, we are making the generation attribute into a categorical variable since Pokemon can only be from 6
# different generations
pokemon$Generation <- as.factor(pokemon$Generation)
head(pokemon)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## Sp..Def Speed Generation Legendary
## 1 65 45 1 False
## 2 80 60 1 False
## 3 100 80 1 False
## 4 120 80 1 False
## 5 50 65 1 False
## 6 65 80 1 False
Generations Analysis
This section is devoted to looking at how good the characteristics of Pokemon are across different generations to get an idea of which generation has the strongest and best Pokemon.
We’re going to start off the analysis portion of this project by simply looking at the number of Pokemon from each generation just to get an idea of how the data is distributed across different generations.
ggplot(pokemon, aes(x=Generation)) +
geom_bar(fill="#ffd866", colour="black") +
labs(x="Generation", y="Number of Pokemon", title= "Number of Pokemon Across Generations")
It looks like most Pokemon are from the first, third and fifth generation with fewer Pokemon from the generations in between.
Next, I am going to look at the number of legendary Pokemon from each generation. The reason I am looking at this is because I suspect that since legendary Pokemons are known to be powerful, it would make sense that the best generation of Pokemon would also have the highest number of lengendary Pokemons in it.
# Filtering the dataset to contain only the legendary pokemons
only_legendaries <- pokemon %>% filter(Legendary == "True")
## Warning: package 'bindrcpp' was built under R version 3.4.4
ggplot(only_legendaries, aes(x=Generation)) +
geom_bar(fill="#66d6ff", colour="black") +
labs(x="Generation", y="Number of Legendary Pokemon", title= "Number of Legendary Pokemon Accross Generations")
However, it is better to get the porportion of legendary pokemon for each generation, rather than merely just the raw number of legendaries for that generation in order to compare the number of legendaries accross generations.
numberLegendaryPerGeneration <- pokemon %>%
group_by(Generation) %>%
summarize(numberLegendary= sum(Legendary == "True"), numberInGeneration= n()) %>%
mutate(ratioOfLegendary= numberLegendary / numberInGeneration)
numberLegendaryPerGeneration %>%
ggplot(mapping=aes(y=ratioOfLegendary, x=Generation)) +
geom_point() +
labs(x= "Generation", y= "Ratio of Legendary", title= "Ratio of Legendaries Accross Generations")
Judging by the Scatterplot above, it looks like generation 3, 4, and 6 have the highest porportion of legendary Pokemon, leading me to anticipate that the best generation could be one of these.
I am now going to create box plots across different generations to see how they compare in the different characterstics described above.
# Boxplot of HP across different generations
pokemon %>%
ggplot(mapping= aes(x= Generation, y= HP)) +
geom_boxplot() +
labs(title= "HP Across Generations")
# Boxplot of Attack across different generations
pokemon %>%
ggplot(mapping= aes(x= Generation, y= Attack)) +
geom_boxplot() +
labs(title= "Attack Across Generations")
# Boxplot of Defense across different generations
pokemon %>%
ggplot(mapping= aes(x= Generation, y= Defense)) +
geom_boxplot() +
labs(title= "Defense Across Generations")
# Boxplot of Special Attack across different generations
pokemon %>%
ggplot(mapping= aes(x= Generation, y= Sp..Atk)) +
geom_boxplot() +
labs(title= "Special Attack Across Generations")
# Boxplot of Special Defense across different generations
pokemon %>%
ggplot(mapping= aes(x= Generation, y= Sp..Def)) +
geom_boxplot() +
labs(title= "Special Defenese Across Generations")
# Boxplot of speed across different generations
pokemon %>%
ggplot(mapping= aes(x= Generation, y= Speed)) +
geom_boxplot() +
labs(title= "Speed Across Generations")
# Boxplot of total across different generations
pokemon %>%
ggplot(mapping= aes(x= Generation, y= Total)) +
geom_boxplot() +
labs(title= "Total Across Generations")
Looking at the different boxplots in various generations, it appears that the skill level of the different generations are actually rather evenly distributed. In other words, all generations seem to have been created equal, and there isn’t a generation that is clearly superior to another one. These results make sense however, because the Pokemon games include Pokemon from that current generation of the game or from earlier generations, and it would only be fair to have the overall strength and skill of newer generations remain about the same as older ones to keep the game competitive.
The conclusion of analyzing the stats of Pokemon across generations is important because now players who play the game can condifently know that they wouldn’t be put at a disadvantage just because their Pokemon is from a certain generation; players can know that Pokemon across the generations are all created equal.
Legendary Pokemon Analysis
As I mentioned earlier, legendary Pokemon are known to be stronger than just the regular Pokemon, so now I want to compare the violin plots of legendary Pokemon versus non-legendaries in order to see the difference between their stats.
# Violin plot comparing Legendary and Non-Legendary HP
pokemon %>%
ggplot(aes(x=Legendary, y=HP)) +
geom_violin() +
labs(title="Legendary Vs. Non-Legendary HP",
x = "Legendary",
y = "HP")
# Violin plot comparing Legendary and Non-Legendary Attack
pokemon %>%
ggplot(aes(x=Legendary, y=Attack)) +
geom_violin() +
labs(title="Legendary Vs. Non-Legendary Attack",
x = "Legendary",
y = "Attack")
# Violin plot comparing Legendary and Non-Legendary Defense
pokemon %>%
ggplot(aes(x=Legendary, y=Defense)) +
geom_violin() +
labs(title="Legendary Vs. Non-Legendary Defense",
x = "Legendary",
y = "Defense")
# Violin plot comparing Legendary and Non-Legendary Special Attack
pokemon %>%
ggplot(aes(x=Legendary, y=Sp..Atk)) +
geom_violin() +
labs(title="Legendary Vs. Non-Legendary Special Attack",
x = "Legendary",
y = "Special Attack")
# Violin plot comparing Legendary and Non-Legendary Special Defense
pokemon %>%
ggplot(aes(x=Legendary, y=Sp..Def)) +
geom_violin() +
labs(title="Legendary Vs. Non-Legendary Special Defense",
x = "Legendary",
y = "Special Defense")
# Violin plot comparing Legendary and Non-Legendary Special Defense
pokemon %>%
ggplot(aes(x=Legendary, y=Speed)) +
geom_violin() +
labs(title="Legendary Vs. Non-Legendary Speed",
x = "Legendary",
y = "Speed")
# Violin plot comparing Legendary and Non-Legendary Total
pokemon %>%
ggplot(aes(x=Legendary, y=Total)) +
geom_violin() +
labs(title="Legendary Vs. Non-Legendary Total",
x = "Legendary",
y = "Total")
Judging by the violin plots above, it is clear that legendary Pokemon are distributed around a higher score for every statisic compared to regular Pokemon, especially in the “total” statistic mentioned earlier. The results of these violin plots make sense because the reason certain Pokemon are classified as legendary Pokemon are to indicate that the Pokemon is rare and powerful. The conclusion of these violin plots reassures Pokemon players that legendary Pokemon are indeed significantly better than just regular Pokemon, and having legendary Pokemon can definitely strengthen the skills of your team of Pokemon.
** Mega Pokemon Versus Legendaries **
We know that legendary Pokemon are far stronger than other Pokemon from the data analysis above. However, there are other types of Pokemon in the game called mega Pokemon, which are essentially an upgraded and more powerful version of an existing Pokemon. I thought that comparing legendaries against mega Pokemon across the different stats would be an interesting comparison since both of these types of Pokemon are known to be powerful to begin with.
To learn more about mega Pokemon: https://bulbapedia.bulbagarden.net/wiki/Mega_Evolution
First, I am creating a new column that indicates whether a Pokemon is Legendary, Mega or neither.
# Creating a new column called Special, that can either take on the values of Legendary, Mega, or Neither
special_pokemon <- pokemon %>%
mutate(Special= ifelse(Legendary == "True", "Legendary", ifelse(str_detect(Name, ".+Mega.+"), "Mega", "Neither")))
# Making that new column into a categorical variable
special_pokemon$Special <- as.factor(special_pokemon$Special)
head(special_pokemon)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## Sp..Def Speed Generation Legendary Special
## 1 65 45 1 False Neither
## 2 80 60 1 False Neither
## 3 100 80 1 False Neither
## 4 120 80 1 False Mega
## 5 50 65 1 False Neither
## 6 65 80 1 False Neither
Comparing the density plots of Legendary Pokemon versus Mega Pokemon versus Regular Pokemon across different stats.
# Comparing HP
special_pokemon %>%
ggplot(aes(x= HP, fill= Special)) +
geom_density(alpha= 0.5) +
labs(x= "HP", y= "Density", title= "HP of Mega vs Legendary vs Neither")
# Comparing Attack
special_pokemon %>%
ggplot(aes(x= Attack, fill= Special)) +
geom_density(alpha= 0.5) +
labs(x= "Attack", y= "Density", title= "Attack of Mega vs Legendary vs Neither")
# Comparing Defense
special_pokemon %>%
ggplot(aes(x= Defense, fill= Special)) +
geom_density(alpha= 0.5) +
labs(x= "Defense", y= "Density", title= "Defense of Mega vs Legendary vs Neither")
# Comparing Special Attack
special_pokemon %>%
ggplot(aes(x= Sp..Atk, fill= Special)) +
geom_density(alpha= 0.5) +
labs(x= "Special Attack", y= "Density", title= "Special Attack of Mega vs Legendary vs Neither")
# Comparing Special Defense
special_pokemon %>%
ggplot(aes(x= Sp..Def, fill= Special)) +
geom_density(alpha= 0.5) +
labs(x= "Special Defense", y= "Density", title= "Special Defense of Mega vs Legendary vs Neither")
# Comparing Speed
special_pokemon %>%
ggplot(aes(x= Speed, fill= Special)) +
geom_density(alpha= 0.5) +
labs(x= "Speed", y= "Density", title= "Speed of Mega vs Legendary vs Neither")
# Comparing Total
special_pokemon %>%
ggplot(aes(x= Total, fill= Special)) +
geom_density(alpha= 0.5) +
labs(x= "Total", y= "Density", title= "Total of Mega vs Legendary vs Neither")
Judging by the density plots comparing Legendary to Mega Pokemon to just regular Pokemon, it’s clear that Mega Pokemon and Legendary are significantly superior to regular Pokemon in every statistic. Mega Pokemon and legendary Pokemon appear to be much more closely contested. Mega Pokemon are superior to legendary pokemon in Attack and Defense, but fall short to legendary Pokemon in all other statistics.
These density plots show that although Mega Pokemons are a superior and strong version of an already existing Pokemon, overall, they still fall short compared to legendary Pokemon. These results show just how powerful and dominant legendary Pokemon are to players of the game, and how having legendary Pokemon on your team could be a big plus.
** Notable Pokemon **
We’ve spent a lot of time comparing the skill levels of different groups of Pokemon, but now I’m going to look at just the top 5 Pokemon in each statistic to give players a better idea of which specific Pokemon really stand out.
# Top 5 HP
pokemon %>%
arrange(desc(HP)) %>%
slice(1:5) %>%
ggplot(aes(x=reorder(Name, HP), y=HP)) +
geom_bar(stat="identity", fill="#D84595", colour="black") +
coord_flip() +
labs(x="Name", title="Top 5 HP Pokémon")
# Top 5 Attack
pokemon %>%
arrange(desc(Attack)) %>%
slice(1:5) %>%
ggplot(aes(x=reorder(Name, Attack), y=Attack)) +
geom_bar(stat="identity", fill="#D84595", colour="black") +
coord_flip() +
labs(x="Name", title="Top 5 Attack Pokémon")
# Top 5 Defense
pokemon %>%
arrange(desc(Defense)) %>%
slice(1:5) %>%
ggplot(aes(x=reorder(Name, Defense), y=Defense)) +
geom_bar(stat="identity", fill="#D84595", colour="black") +
coord_flip() +
labs(x="Name", title="Top 5 Defense Pokémon")
# Top 5 Special Attack
pokemon %>%
arrange(desc(Sp..Atk)) %>%
slice(1:5) %>%
ggplot(aes(x=reorder(Name, Sp..Atk), y=Sp..Atk)) +
geom_bar(stat="identity", fill="#D84595", colour="black") +
coord_flip() +
labs(x="Name", title="Top 5 Special Attack Pokémon")
# Top 5 Special Defense
pokemon %>%
arrange(desc(Sp..Def)) %>%
slice(1:5) %>%
ggplot(aes(x=reorder(Name,Sp..Def), y= Sp..Def)) +
geom_bar(stat="identity", fill="#D84595", colour="black") +
coord_flip() +
labs(x="Name", title="Top 5 Special Defense Pokémon")
# Top 5 Speed
pokemon %>%
arrange(desc(Speed)) %>%
slice(1:5) %>%
ggplot(aes(x=reorder(Name,Speed), y= Speed)) +
geom_bar(stat="identity", fill="#D84595", colour="black") +
coord_flip() +
labs(x="Name", title="Top 5 Speed Pokémon")
# Top 5 Total
pokemon %>%
arrange(desc(Total)) %>%
slice(1:5) %>%
ggplot(aes(x=reorder(Name,Total), y= Total)) +
geom_bar(stat="identity", fill="#D84595", colour="black") +
coord_flip() +
labs(x="Name", title="Top 5 Total Pokémon")
Judging by the top 5 lists of Pokemon in each category, it looks that Mega Rayquaza is a very strong Pokemon as it is ranked first in the total attribute, and it also appears in the top 5 in special attack and attack. Mega Mewtwo X and Y also appear to be extremely powerful Pokemon that appear in the top 5 in total, special attack, and attack. Overall, it looks that Mega Rayquaza, Mega Mewtwo X and Mega Mewtwo Y are very powerful attacking Pokemon and also have high stats in their total ranking, making me believe that these three Pokemon could debatably be the best three Pokemon in the game.
These lists of top 5 Pokemon could be interesting for players of the game when they are debating who the best Pokemon is in the game, and it would be interesting to see if there is a clear and definitive “best” Pokemon in the game, and what the criteria and credentials would be in order to determine that. Debating the greatest of all time has always been an engaging debate in the world of sports, and I’m sure it’s not different here in the game of Pokemon.
** Machine Learning **
In this very last section I will be using Machine Learning to predict whether or not a Pokemon is a legendary Pokemon based on the different statistics of a Pokemon.
In this first section, I will be splitting up the data into testing and training data sets.
set.seed(1234)
# Splitting up the data into the testing and training data with 20 percent of the data going into the
# testing data and 80 percent going into the training data
test_random_forest_df <- pokemon %>%
sample_frac(.2)
train_random_forest_df <- pokemon %>%
anti_join(test_random_forest_df, by="Name")
Here, we are building the random forest model and also showing the model error.
# Building the random forest model
rf_model <- randomForest(Legendary~., data=train_random_forest_df %>% select(-Name, -X.))
# Showing the model error
plot(rf_model, ylim=c(0,0.45))
legend('topright', colnames(rf_model$err.rate), col=1:3, fill=1:3)
The black line is showing the overall error rate which is below 10%, and the red and green lines show the error rates for “Not a Legendary” and “Legendary” respectively. Looking at the graph, we see that our model is much more successful predicting the Pokemon that are not legendary than the ones that are.
I am now going to look at the importance of the different statistics in predicting whether or not a Pokemon is legendary. We look at relative importance by plotting the mean decrease in Gini calculated across all of the different trees.
# Getting the importance
importance <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,'MeanDecreaseGini'],2))
# Based on the importance, creating a "rank" variable
rankImportance <- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance))))
# Plotting the relative importance of the different statistics of Pokemon
ggplot(rankImportance, aes(x = reorder(Variables, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat='identity') +
geom_text(aes(x = Variables, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 4, colour = 'green') +
labs(x = 'Variable') +
coord_flip()
After ranking the importance of each statistic, we can clearly see that the most important statistic to use when predicting whether or not a Pokemon is legendary is the Total statistic (which is just a summation of all the other stats). This makes sense because the total statistic is supposed to be an indicator of overall strength of a Pokemon, so it makes sense that examining a Pokemons’ overall strength can really help predict whether a Pokemon is a legendary or not.
It’s also interesting to see that the second most useful predictor wasn’t one of a Pokemons stats like Speed or Special Attack, but instead it is a Pokemons’ first type. This leads me to think that certain types of Pokemon are stronger than other types of Pokemon unlike how different generations of Pokemon are equally strong. You can further see that Generation isn’t useful in distinguishing between which Pokemon are better than others because it’s the least useful variable in predicting which Pokemon are legendary.
Next, we’re making predictions on the held out test set and examine the error rate.
test_predictions <- predict(rf_model, newdata=test_random_forest_df %>% select(-Name, -X.))
confusion_matrix <- table(pred=test_predictions, observed=test_random_forest_df$Legendary)
error <- mean(test_random_forest_df$Legendary != test_predictions)
error
## [1] 0.01875
After making predictions on the testing data, we see that we have a very small error rate.